Full text search using Elasticsearch for blazingly fast search

Now imagine this it is 2005 the year 2005 and you are a software engineer and you work basically at a e-commerce based company and the company is rapidly growing because it's 2005 because of the com bubble etc right the company is rapidly growing and you have a task you are a software engineer and you have a task to search through thousands of products your company has around let's say 5,000 products And you have to write a database query to create an API which searches given a

user input searches through your database and returns some relevant results. Right? And this is how you write the query for that in a typical relational database based setup. You say select star from products where name is like laptop. So these percentage symbols are for matching any characters that comes before and comes after that. Either it can match name or it can match description. If the keyword laptop is

either in name or in description, you return those results. Right? This is how you do a basic search and the customers search for laptop. They get a few results and life is very simple. Now something happens since it is 2005 and because of a lot of companies growth your company also grew and it grew rapidly. Suddenly you had millions of products, right? The the very straightforward I like and the person symbol based search in your relational database query that once returned

results in let's say 50 milliseconds. Now it is taking 30 seconds and because of that the your customers are frustrated, your manager is frustrated and everyone is asking you to optimize it and you are clueless. They don't just want to make it faster. They also have more requirements. Requirements like can we make the search smarter. For example, when someone searches for laptop, you

want to show the most relevant results faster. For example, let's say someone searches laptop, instead of showing a laptop bag, we want to show a MacBook Pro first, right? So that kind of smart searching, relevancebased searching. And we also have a requirement that customers are there in a hurry during the sale time, right? So instead of writing laptop, they are writing laptop pretty frequently. And we also want to take that into account even though they make typos, customers make typos in

their user queries, we still want to return relevant results for laptop instead of laptop, right? We want to make it faster. We want to make it relevant. And we also want to be robust enough that typos cannot break us. And with all these requirements, this is the story of why search engines like elastic search came into place because of all these requirements. And this is what we are going to talk about to get the intuition of it. Your posgress database

or whatever relational database that you are using. Now that database is like a librarian. You go to a librarian and you ask a librarian the locations of different different books or different different categories. Right? So your Postgress database is like a librarian who knows exactly where every book is located but it has one fatal flaw. So we have our librarian here. If you ask the location of a random book, then to find that book, the librarian has to look

through every single book in every single shelf in the whole library one by one. That is its fatal flaw. So for example, let's say you ask for a book called machine learning. You go to the library and you ask the librarian that I am looking for books about machine learning. Right? So the librarian starts its search. It finds the first book. The first book's name is Harry Potter and the Philosopher Store. And it checks

that no, there is no machine learning in this. Moving on. Then it finds that Game of Thrones. Again, there is no machine learning in that title. Moving on. Then it finds a book which is called Introduction to Machine Learning. Finally has a match and it returns that. Now that librarian will continue this process of going through every book one by one in the whole library. Depending on the size of the library, it might take a few minutes to a few hours in in

fact a few days if it's a huge library, right? Let's say a library with 10 million or 1 billion books. That is one thing. The time it takes to find all our books. The second thing is the librarian has no concept of relevance. And what do we mean by relevance? In the sense that if the librarian finds two books, in one book, the the title name, the name of the book is an introduction to machine

learning. And in another book, the word the phrase machine learning is just mentioned in the last page and the topic is something else. But machine learning is just mentioned there. And it returns both of these books in any order it wants, right? It can return the second book which only has machine learning in the last page which is not very relevant and it can return that book first and return the book named an introduction to machine learning second. Even though if

you're asking for books about machine learning, we want the first book is titled as an introduction to machine learning first and we might not even want the second book because it is pretty irrelevant in our use case. Coming back to our databases, if you think about the librarian, the example that I talked about just before as our database itself, then a query like this where we use I like which is basically a case insensitive search and using percentage symbol. This

is exactly how the librarian looks for books in a huge library. The database has to scan every single row, examine every single text field and perform pattern matching character by character. Now it is thorough. It is going to return results results that are matching. But it is painfully slow and second thing the relevance thing. It does not really know which books which results are more important than others.

It returns them in a random order. We might want some results. Some results might be completely irrelevant and some results might be very relevant but they might be in the position of let's say a,000 or 10,000 right the database has no sense of what is the most relevant result okay now coming back to our history since we are talking about 2005 around the same time the information explosion happened right there are so many websites being built there are so many servers databases and all these big

companies like Google was processing ing billions of web pages it was indexing crawling through them and Amazon as a e-commerce company it was cataloging millions of products and LinkedIn was indexing millions of profiles and these companies they cannot afford to wait 30 seconds for search results obviously customers are going to leave the site 30 seconds is way too much in our current standard even 2 seconds of delay is considered considered a crazy amount of

delay especially when we are talking about latency. Search results have to be very fast. We are talking about milliseconds here, right? So obviously these big companies where they are making millions and millions of dollars every single second. They cannot afford this kind of latency. This will affect their conversion, their user base. So to solve the problem of search the answer came from decades of research. It was not a new problem. There has been countless research in in this field when

it comes to textbased search and information retrieval since decades. And computer scientists have been studying how to get the results fast and keep them relevant. and researchers, scientists, computer scientists, they have been investing a lot of time and a lot of effort into it since the 1960s. So the revolution in the idea came from this particular uh key thought or key idea that so instead of searching

through the documents in case of the librarian instead of searching through the books the title the content all the books to find the relevant terms what if we flipped the problem. What if we inverse the problem? What if we look at it in a different way? And that was the birth of this key invention, the key idea that changed textbased search forever which is called inverted index.

Now this is a very simple concept even though the implementation might have more complicated stuff more mathematics but for a highle understanding it is a very simple concept that we are going to talk about. Coming back to our earlier example where we think about the database as a librarian and we ask the librarian we ask the librarian that we want books related to machine learning right and the librarian goes to every book to search through the title to the content to the description etc etc now

coming back to the same problem but this time we look at it using this technique called inverted index and the revolutionary idea here is instead of going through every book and checking the existence of machine learning in those books. What if while storing the books in the shelf when the first time the books arrive, we take all the words in that book and we create an index of those words in a way that for a

particular word we can find out that what are all the books where this word is used and also where exactly it is used. For example, let's say we are talking about machine. So machine is used in this book called introduction to machine learning in page 1, 15 and 23. Same way another book called the machine age here it is used on page five on page 89 and another book where coffee where

coffee machine manual it is used in page one. Same way for learning introduction to machine learning it has been used three times 1 16 24 even though in a practical scenario for a book called introduction to machine learning the words machine and learning will be most likely used hundreds of times throughout the books. For the sake of this example we are just saying three pages but they are most likely used in hundreds of pages. Same way learning to cook it's used in two places and deep learning

fundamentals it's used in three places. Now we have this index which we call inverted index. So instead of going through the content and finding the terms we have the terms and through the terms we find the content. We just inverted the search. That's why it's called inverted index. And this technique, this technique of inverted index, this is the thing that powers elastic search, a very famous tool. When we are talking about full text search,

elastic search, elastic search also has an history. It is not a completely new invention. It makes use of this technology which is called Apache lucine. This is the key technology, the key uh inverted index based technology. which elastic search and a lot of other tools. Elastic search is not the only cool text search tool. In fact, these days even relational databases like posgress also have support for full text search. So, elastic search is not the only tool. We have also have a lot of

other tools full text search based tool which most of them make use of this underlying technology called Apache Lucion which is primarily the inverted index based text search technology. Now that is one key concept. Now that we have this inverted index, the librarian can just refer to this. We say machine learning, they can see that machine is used in all these books and learning is used in all these books. This is the first thing the librarian and the database. We are talking about both of these things here just to make it more

intuitive. We are talking about librarian but you can think about database when we say librarian. Now the librarian is faster but there is another advantage and a huge advantage which is relevance. So from these results we can see that for the term machine it is used in three places. Now as I said in a book called introduction to machine learning this term machine is most probably used in hundreds of places but for this sake of this example it's used in three places and for the book the machine age

it is used in two places. Now the elastic search technologies and tools like elastic search they also have this feature called relevance scoring we don't have to understand how actually it works understanding is good uh it is definitely good to understand the underlying architecture but again if you just want to make use of the tool and build a service then you can easily refer to docs take the example and just use it just have to know that if you have a use case like this go with elastic search or full text search from

posgress whatever your typical tech stack is but you don't have to completely understand all the technologies what is happening behind the scenes how the index works how the scoring works etc it's it's pretty complicated and unless you are writing another library about elastic search or creating alternative to elast elastic search all that knowledge is not going to help you a lot as long as you know the best practices how to do certain stuff that's pretty much enough now coming back elastic search also has this concept called relevant scoring which

means that it can check the term machine is present in page one in the title of the book that fact itself that the term is present in title it gives a significant boost to its relevant score now this is the most relevant book for our query since the term is present in the title again after that we also see that it is present present three times throughout the book. Now that is again another significant relevance booster.

It is present in the title and it is present in more frequent places. Those two things because of those two things this result is came on top. Same way for this book the machine age it is present in the title. So that is one relevant scoring but it is not as frequently used. So that's why it is coming in the second place. Same for the third thing. It is present in the title but it is very less frequently used. So that's why we got it in the third place. And same logic for this term also. So as you can

see using a tool like elastic search can make the experience of search fast and also give you this additional advantage of relevance based results. We just don't want any result that matches your query. You want the most meaningful result and elastic search tools like elastic search give you that feature. Now it has an algorithm called the BM25 algorithm. As I said there is a lot of theory behind this. If you are

interested you can learn more about it. But I would suggest just treat it as a tool. Whenever you want to use it just know that if I have a use case like this I have to go with elastic search. then just refer to the docs, implement the feature and move on instead of spending too much time understanding the theory. But again, if you are curious then it's great if you want to learn more then elastic search docs is a very good place to learn more about these things. So this algorithm that it uses the BM25

algorithm it has different different parameters to sort the results. So one thing that we already talked about is the term frequency. How often the term machine appears in the document. How often if we consider this book as a document in elastic search all the a single entity is called a document. It's a JSON document kind of like MongoDB. So the first category if we see it is term frequency which checks that how often a

particular term is present in a document. And the second category is document frequency which checks that how common is the term machine across all documents. Right? This is in a single document and this is across all documents. Third one document length which checks basically how long a particular document is. If it is a short document or a long book like document then another thing is field boosting. Now if you have worked with elastic search before then field boosting

feature is something that we use a lot. Field boosting basically means that this term machine if it appears in the title then that is more relevant as compared to it appearing in the description and it appearing in the description is more relevant as compared to it appearing in the content. Let's say for a particular book we have a title then that is more relevant than the description and that is more relevant than the content. So if a term appears in the title that is more

relevant than it appearing in the description and that is more relevant than it appear in the content. That is what we call as field boosting. And while making a search query we can say we can define our own field boosting criterias. We can say that if it appears in the content then that should be more relevant. So we can alter these things. It's not fixed. It's upon us. The elastic search DSL the the language that we use. It's a JSON based query language

and it offers a lot of features a lot of parameters to do different kinds of search and using that we can extract different different kinds of results. So there are different use cases when we are using elastic search. Apart from a just basic text matching the first one is of course the typo tolerance let's say. So I will show you an example. So we want to make Google search. Now this is a very good example. We are not sure what exactly the technology that Google

uses whether it is elastic search or whether it is something proprietary that Google has built on top of Apache Lucen but it is a technology similar to elastic search and you can build an interface like this which is called a type ahead interface uh using elastic search right we have this in websites like Amazon where you type something and you get results I want to search what is trending today so I start to write what is So I already got a lot of results. Now these results are scored in

different different parameters that depends on Google. But I want to show something else. I want to search what is trending today. So instead of writing trending, what if I write treading? Trending today. So as you can see, we wrote what is treading today. We intentionally made a typo. But because of this pull text search based capabilities of tools like elastic search they can derive from the context that there is a typo and what is trending today is the most likely

intended query of the user right and this is a major advantage of using a technology like elastic search full text search based technology for these kinds of experiences like typos and all right so if you are building a search like feature in your backend application then you have two options you can go with posgress now posgress as a modern database it also offers a full text search feature so you can go with posgress or if your company already uses

elastic search now elastic search is a tool is not just used for these kinds of features like type ahead and full deck search elastic search is a very famous tool which is also used in the domain of log management So there is this very famous stack called ELK stack. So if we look for Elk stack, Elastic Search, Kibana and Lock stacks. All these three technologies combined a very famous stack for managing logs etc. Right? Since elastic search is very fast in searching things,

searching through logs and deriving statistics and visualizing data etc etc. There is a very famous use case of elastic search. So if your company already has elastic search as a part of elk stack for log management then going with elastic search for your full text search based requirements also makes a lot of sense instead of going with posgress. Let's go through an example of the difference between a traditional database search and an elastic search based search. I have a project here. It

is a nextjs based project. There is no reason for creating it in nextJS just that LLMs are very good at generating these prototype based applications using nextJS. That's why I went ahead with it instead of writing my own in any other language. In this project basically we have a UI like this where we are comparing we are executing query and running that query in a traditional posgress based database and also a elastic search instance. To keep it fair, I have created a project in neon serverless cloud-based posgress instance

and a project in elastic cloud right and both the instances are situated in the region of US- west. So latency because of distance should not be a big factor in our test which or kind of benchmark that we are going to do just now. But before that I just want to explain what I have done in this project for the purpose of the demo. I created a table called reviews in a posgress database in the neon posgress database where the table has three fields one ID one review

is a text field and one sentiment is positive or negative that is the database table then going to this script which populates our database and our elastic search index I have this very big CSV file it is around 50,000 entries of review and sentiment. I have collected it from somewhere. I'm not sure. So, it has two fields. It has a review and it has a sentiment that whether it is a positive review or negative review, right? And it has

50,000 entries. And we want to insert it first into our database and our elastic search index. So, what am I doing? What I'm doing here in the populate script is it's a NodeJS script and I'm taking the database URL from environment and elastic search address and the elastic search API key right if any of them are not present I'm throwing an error then we initialize the clients for the database we start in neon client and for

the elastic search we start in elastic search client is where we are building the clients Then we read the data from CSV. We are just doing a read file sync and we are reading the data and storing it in this variable. Then first we populate the Postgress instance. This is our script to run which basically says that if this table is not already existing then create a table. But since I have already run this migration once the table

already exists this part does not really run. Then we are just resetting the ID field so that we start fresh. Then we are doing some kind of filteration so that we have both the review and the sentiment field. If it does not have it, we're just filtering it out. Right? Then we have valid records then we start inserting them in batches. Since inserting them all at once has a limitation in the neon instance. We are inserting them in batches. So we are creating a batches of thousand and we're

looping through those batches and we are just doing a normal insert statement insert into reviews review and sentiment and just in the values we are just inserting the review and the corresponding sentiment and we are executing the query here. That's the database part. Then we are populating the elastic search if the index exists or not. We are checking first whether this elastic search index which is called reviews exist or not. If it already exists, we are deleting it since we want to start fresh. If if it does not exist, then here we are creating an

index. And in the inside that index, here is how we map the fields. We're saying that we have two fields. Review, which is of type text, and sentiment, which is of type keyword. Text basically means that we don't want to do any kind of discrimination around the different words in the text. But keyword means that we want to exactly match that particular word. Now again we are checking the val records by the present of both the fields. Then we are doing a bulk insert of all the 50,000 records in

the elastic search. So I have gone ahead and run this script. You can see here 50,000 documents inserted to elastic search etc. And if we check in our cloud instance, you can see that in our elastic search cloud instance, we have 50,000 documents here all populated. Same for our neon uh database. So if we come here in the SQL editor, if we run this query, select count star from reviews, then we get this value that 50,000, which means that we have 50,000

rows in our reviews table. So both the database and the elastic search have been populated successfully. Then what am I doing here? This is the key part, not the UI. UI is pretty generic. The route. This is basically the API endpoint which we call once the user types something and hits enter runs the query. This is the API we call. So what exactly were we doing here? We are taking the search term from the JSON and if there is something wrong with it, we just threw a bad request error and we are starting a stream because the result

timing of the database and the elastic search can be different and we don't want to delay the result of one because of the other. That's why as the results are available to us, we are just streaming it to our front end so that we can immediately see which one is faster and how much time it is taking. Right? For that we are just starting a new stream here. And this is our posgress search. And what are we doing? We are just starting a timer to track how much time it took to run the query. And this

is the query that we are running. Select these three fields from reviews table where review is like this search term as I have already shown in the just a while back. This is how we typically do a search in a relational database. We use I like because we want to do a case insensitive search to cover more uh breadth and we put the percentage sign before and after the term to say that we it does not matter what character comes before it and after it as long as the term is present itself we want to take that result. Then we have this query and

we are just running it and taking all the information that comes with it and we are just sending the result. End result is the function which actually streams the data to our front end as the data is available. Right? Second, elastic search. Same thing we are doing. We are running a search through the elastic search index. We have the index name here which is reviews and we are searching a query string. In the queries the search term we are converting it to lower case to cover as much breadth as possible. And whatever characters that

come before and after it, we are taking that and we are adding some default fields to enhance the search. Again, we are running it. We are executing the result and returning whatever that data comes with it. And as a response, we are saying that we are returning the stream of response as it becomes available. And in our front end, we are just reading that response as it comes through. Right? There is nothing to see in the front end side. Now, moving on. This is a demo. So let's say I want to run a

query for the keyword laptop. Now laptop and let's say I hit search. This is our elastic search results, right? How much time it took? It took around 1 second. And our database search it took around 3 seconds. Almost 4 seconds, but 3 seconds. Now run something else. Let's say let's say something only. and we hit search and elastic search results is already ready in 500 milliseconds and it

found around 8,000 results. And our database also found the same number of results because to keep it fair, we are kind of keeping the search criteria the same by converting it to lower case and matching any characters that come before and after. Because in this demo, we just want to show the speed, right? Even if we run it one more time again our elastic search results is here in 500 milliseconds and our postcrist search results is still running and it took

around 7.5 seconds for the database results to come even though the number of results same the time it takes is significantly larger in a relational database with the I like base syntax so this is what I wanted to show where you have a use case of any kind of search or type ahead things like that you go ahead with full text search you can go with posgress full case full text search or elastic search a tools like elastic search right there's something as a backend engineer

you should have in your arsenal you don't have to master it or anything it is knowing the knowledge of elastic search is not as important as the knowledge of database knowledge of database is something you absolutely have to master you have to master how to work with databases how to optimize them how to understand indexes is every single part of it because that is something that involves almost 99% of your codebase as a backend engineer. But elastic search is something that you can get away with just by copy pasting some

snippets from any LLM or any docs. Now, of course, if you want to optimize, you have to read a little more into it. But most of the use cases, most of the examples provided in these docs and in these snippets are pretty much more than enough to cover most of the search based use cases. That's pretty much all about full text search, full text search, and elastic